Contents filter based on the comparison between similarity of content character and correlation of subject matter

ABSTRACT

A contents filter based on similarity of content character and correlation of subject matter includes a filtering system and a disciplining system, and the contents filter isn&#39;t a filtering system used for a special subject matter but a general subject matter, the filtering contents can be obtained by leaning of the disciplining system, the filtering system and the disciplining system are installed of physical separation, the filtering system communicates with the disciplining system through the data interface, the filtering system can be installed in the input device of network information. To achieve the different filtering effect the different filtering character obtained by the disciplining system are set to the filtering system located in the different input device of network information. The present invention implements filtering through analyzing and determining text contents, and offers an intelligent and effective service of contents safety for user. The use of the filter is in great agility. Furthermore the filter can identify the contents character to be filtered according to the character with disciplined class provided by user. The processing speed is fast and the filter can be conveniently installed.

FIELD OF THE INVENTION

[0001] The present invention relates to a kind of filter system of textcontents information in the field of Chinese information processing, andparticularly relates to a filter of text character analysis based onsimilarity of contents and correlation of subject matter, it belongs tothe field of computer information technology.

BACKGROUND OF THE INVENTION

[0002] The rapid development of computer and network technology and thepopularization of Internet have made network an important approach ofgetting information for people.

[0003] The information on network is much great, some unhealthy contentsand information which are not desired are increasing, and all thesebring bad effect and heavy economic burden. At present, the problem ofthe youth contacting with unhealthy contents through Internet hasattracted highly regard of all circles in society. On the other hand,some information related to the stability of society and the violationof morality also influence the normal social livings, so filtering thenetwork information is the necessary one of major effective means ofpreventing the spread of a mass of information violating the socialpublic interest.

[0004] Now the principle of some existed network information contentsfilter is the method based on key words matching. The method has a greateffect of filtering on the contents directly exist and without disguisein the information. But the method based on key words matching will notfunction properly on treated contents or contents with interferinginformation. It is an obvious limitation that the traditional methodbased on key words matching has.

[0005] To make up the rigescence and limitation of the method based onkey words matching, there are some methods which extracts the filteringcharacter through disciplining, then transmits the filtering characterto the filtering system as filtering rule, the advantage of this methodis that it overcomes the shortcoming of the method based on key wordsmatching in unsuitability for information with interferentialinformation.

[0006] But the device using this method is fixing the discipliningsystem and the filtering system together, the disadvantage is asfollows: Because each parameter used for filtering is generated by thedisciplining system, the disciplining system generally is large andpowerful, while for the sake of flexibility, the filtering system issmall to setup in kinds of systems. The prior art of bounding thedisciplining system and the filtering system together affects theflexibility of the filtering system, and at the same time restrains themighty power of the disciplining system.

SUMMARY OF THE INVENTION

[0007] The main object of the present invention is to provide a contentsfilter based on similarity of content character and correlation ofsubject matter. This contents filter is made more flexible on analysisand determination of text contents by separating the disciplining systemand the filtering system and offers an intelligent and effective serviceof contents safety for users.

[0008] Another object of the present invention is to provide a contentsfilter based on similarity of content character and correlation ofsubject matter. The contents filter isn't a filtering system used for aspecified subject matter but a general subject matter. The contentfilter can be obtained by learning and can be more flexible for usersusing the filter.

[0009] Another object of the present invention is to provide a contentsfilter based on similarity of content character and correlation ofsubject matter, the filter identifies the content character to befiltered according to the character of disciplined class supplied byuser, the contents will be filtered if the similarity of contentcharacter is beyond the preset threshold value.

[0010] Another object of the present invention is to provide a contentsfilter based on similarity of content character and correlation ofsubject matter, the processing speed of the filter is fast and thefilter can be conveniently installed.

[0011] The objects of the present invention are accomplished as follows:

[0012] A contents filter based on similarity of content character andcorrelation of subject matter includes a filtering system and adisciplining system at least; the disciplining system learns with theappointed information to obtain the filtering character of theinformation; the filtering system filters the information, and thedisciplining system communicates with the filtering system.

[0013] The contents filter includes one disciplining system and one ormore filtering systems; of course the contents filter can include onefiltering system and one or more disciplining systems; the contentsfilter also can include more filtering systems and more discipliningsystems.

[0014] The filtering system and the disciplining system are installedseparately in physics. The filtering system communicates with thedisciplining system through the data interface and the said filteringsystem can be set in an input device of network information.

[0015] For further enhancement of filtering effect for differenttargets, the different information obtained by the disciplining systemis configured to the filtering systems in the different input devices ofnetwork information respectively.

[0016] The mentioned configuration means that the disciplining systemdistributes the filtering character of the filtering system according tothe burden capacity, the location and the purpose of the input devicesof network information in network; the input device of networkinginformation is firewall or mail server or proxy server or personalcomputer; and can also be one input device of network information ormore input devices of network information or the combination of any typeof input device of network information.

[0017] In detail, the disciplining system includes a module ofclassifying character vocabulary for contents filtering, which is usedto construct a classifying character vocabulary learned from thespecified information, and conduct the supplement or update of theclassifying character vocabulary. The classifying character vocabularyis obtained from the specified learning information by the module ofclassifying character vocabulary for contents filtering, once thevocabulary is constructed, the disciplining system will transfer thevocabulary contents to the filtering system through the standard datainterface, then the filtering system follows the vocabulary to performthe filtering action, and accordingly implements the instruction to thefiltering system's action.

[0018] The disciplining system still includes an anti-interferenceextracting module of text character for contents filtering. The saidanti-interference extracting module of text character for contentsfiltering is used to examine and obtain the interferential text in thechecked information, and then instruct the actions of text filtering ofthe filtering system. At first this module finds the specified textinformation in the checked text contents to check whether the sequenceof the specified text contents accord with the sequence of the presettext; then determines the interferential distance between the specifiedtext information and the checked text contents, if the distance is lessthan the preset threshold distance, the text contents are set as theinterferential text contents to be selected.

[0019] The disciplining system still includes an anti-interferenceextracting module of text subject matter; the procedure of extractingthe anti-interference subject matter words includes the steps asfollows:

[0020] Step 1: The anti-interference extracting module of text subjectmatter examines the specified character in the checked text, todetermine whether the sequence of the specified character accords withthe sequence of the preset character in the preset subject matter words,i.e. finding the specified character string;

[0021] Step 2: The anti-interference extracting module of text subjectmatter determines the interferential distance, if the interferentialdistance is less than the preset threshold distance, the string isconsidered as the interferential subject matter words to be selected;

[0022] Step 3: While the anti-interference extracting module of textsubject matter concludes that the frequency of appearance of thementioned subject matter words is beyond the preset threshold value, thesubject matter words to be selected are set to the key words of thefilter.

[0023] The process of the specified character being examined by theanti-interference extracting module of text subject matter stillincludes finding whether the specified characters contain Chinesepunctuations among them, if the specified characters don't containChinese punctuations, the character string is the interferential subjectmatter words, the anti-interference extracting module of text subjectmatter will consider the character string as the key words of thefilter.

[0024] The said anti-interference extracting module of text subjectmatter can examine the specified character string between two adjacentpunctuations.

[0025] In detail, the frequency of appearance of the interferentialsubject matter word to be selected can be a summation of theinterferential subject matter words of more different types.

[0026] The anti-interference extracting module of text subject matter isused to extract the information relating to text subject matter and thenthe extracted information are rectified, finally the similarity of textbased on the Vector Space Model is rectified according to the rectifiedresult of the subject matter information.

[0027] The process of rectifying the similarity of text based on theVector Space Model according to the rectified result of the subjectmatter information includes the steps as follows:

[0028] Extracting the information relating to subject matter of text, indetail extracting the frequency of word, the concentration of frequency,the information of word length, the words and the total words amount;choosing the information relating to subject matter with top weight asthe information relating to subject matter; rectifying the extractedinformation relating to subject matter, then the similarity of textbased on the Vector Space Model is rectified according to the rectifiedresult.

[0029] The extracting of the subject matter information by theanti-interference extracting module of text subject matter is performedwith the formula as follows: w ik = ( K 1 + K 1 × tf MAX     tf ) × 1log 2 T w tf    × ( K 2 + K 2 × w i MAX     w i ) ,

[0030] in which, □ stands for the factor of the frequency of word; □stands for the factor of the concentration of frequency; □ stands forthe factor of word length, W_(ik) stands for the weight of the word intext i; tf stands for the frequency of the word k in text i; MAXtfstands for the word frequency of the word with maximum frequency; K₁stands for the grade of importance to tf, commonly set to 0.5; MAXW_(l)stands for the maximum value of the word length in the text; K₂ standsfor the grade of importance to W_(l), commonly set to 0.5; T_(w) standsfor the amount of total words (considering the character words only).

[0031] Rectifying the extracted information relating to subject matteris to determine the similarity of contents according to the degree ofoverlapping of subject matter information.

[0032] The rectification of the similarity of text based on the VectorSpace Model is as follows: if the degree of overlapping is more than thethreshold value, the value of eigenvector similarity will bestrengthened, and if the degree of overlapping is less than thethreshold value, the value of eigenvector similarity will be weakened.

[0033] The rectification of the information relating to subject matteris performed according to the following:${R_{is} = {A + \frac{T_{is}\bigcap C_{s}}{C_{s}}}},$

[0034] wherein A is an experiential value reflecting the degree of paidimportance to the subject matter word (0<A<1), R_(is) is a correlationcoefficient of the subject matter word; T_(is) is the subject matterwords amount of the text to be analyzed; C_(s) is the subject matterwords amount of standard class. “∩” stands for calculation ofintersection.

[0035] The rectification of the similarity of text based on the VectorSpace Model is as follows:

Sim(W_(i),V_(j))×R_(is),

[0036] in which, Sim(W_(i),V_(j)) is the similarity of text based on theVector Space Model.

[0037] In addition, the distinguished character of the present inventionis that the disciplining system still includes an evaluation andinstruction module of disciplining effect.

[0038] The evaluation and instruction module of disciplining effect isused to obtain the coefficients of evaluation of character words amount,the evaluation of rate of repeat and the evaluation of degree of subjectmatter centralization, then according to these coefficients, the resultof disciplining effect is educed to give an objective and quantitativeinstruction to disciplining.

[0039] The evaluation of the character words amount is as follows:$Q_{1} = \left\{ {\begin{matrix}1 & {{{when}\quad x_{i}} < \alpha_{i}} \\\frac{A - x_{i}}{A - \alpha_{i}} & {{{when}\quad x_{i}} > \alpha_{i}}\end{matrix},} \right.$

[0040] in which, x_(i) stands for the character words in text ofdisciplining, A stands for the total amount of the character words,α_(l) is an experiential threshold value of the character words amountfor each disciplining evaluation point.

[0041] The evaluation of the rate of repeat is as follows:$Q_{2} = \left\{ {\begin{matrix}{x_{i}/\beta} & {{{when}\quad x_{i}} < \beta} \\1 & {{{when}\quad x_{i}} > \beta}\end{matrix},} \right.$

[0042] in which, x_(l) stands for the mean rate of repeat, β is anexperiential threshold value.

[0043] The evaluation of the degree of subject matter centralization isas follows: $Q_{3} = \left\{ \begin{matrix}{x_{i}/\chi} & {{{when}\quad x_{i}} < \chi} \\1 & {{{when}\quad x_{i}} > \chi}\end{matrix} \right.$

[0044] in which, x_(i) stands for the maximum overlapping rate ofdocument, χ is an experiential threshold value.

[0045] The evaluation of disciplining finally is as follows:Q = Q1 * Q2 * Q3  or  Q = Q1 * Q2  or  Q = Q1 * Q3  orQ = Q2 * Q3  or  Q = Q1  or  Q = Q2  or  Q = Q3.

[0046] Then according to the value of Q, the grade of discipliningeffect is determined.

[0047] In addition, the filtering system includes a module ofclassifying character vocabulary for contents filtering, ananti-interference extracting module of text character, and a module ofcalculating similarity between text contents to be filtered and definedfiltering contents. In addition, the filtering system still includes amodule of rectifying local similarity and short text similarity withprecision.

[0048] The module of rectifying local similarity and short textsimilarity with precision is used to obtain precision of relegation ofstandard class which text to be analyzed belongs to according to thestandard vector of text to be analyzed, and rectify the result of thesimilarity of text based on the Vector Space Model with the saidprecision.

[0049] The method of rectification can be Sim(W_(i),V_(j))×P_(i), inwhich, P_(i) stands for the rectifying coefficient of precision.

[0050] The method of obtaining the rectifying coefficient of precisionis described as follows:$P_{i} = {B\sqrt{\frac{{\Sigma \left( {\sigma_{k}\nu_{jk}} \right)}^{2}}{{\Sigma \left( \nu_{jk} \right)}^{2}}}}$

[0051] in which B≧1 and $\sigma_{k} = \left\{ \begin{matrix}1 & {{{when}\quad w_{jk}} > 0} \\0 & {{{when}\quad w_{jk}} = 0}\end{matrix} \right.$

[0052] and B is an experienced value of the grade of importance to theprecision information.

[0053] The filtering system includes a filtering module according tomulti-step rectified degree of similarity, which is used to gather thecoefficients of precision obtained by each module. With the presetfiltering threshold value U_(w) to determine whether the text to befiltered should be filtered.

[0054] The present invention implements the contents filtering throughanalyzing and determining text contents, and offers an intelligent andeffective service of contents safety. The contents filter isn't afiltering system used for a specified subject matter but a generalsubject matter, and the filtering contents can be obtained by learning.The present invention is also more flexible for user using the filter.Besides, the filter can identify the character of the contents to befiltered with the character of disciplined class, if the similarity ofcharacter is beyond the threshold value, the contents will be filtered,its processing speed is fast and the filter can be convenientlyinstalled.

DESCRIPTION OF DRAWINGS

[0055]FIG. 1 is a schematic diagram showing the structure of thedisciplining system and the filtering system of the present invention;

[0056]FIG. 2 is a schematic diagram showing one embodiment of thepresent invention;

[0057]FIG. 3 is a schematic diagram showing the other embodiment of thepresent invention;

[0058]FIG. 4 is a schematic diagram showing another embodiment of thepresent invention;

[0059]FIG. 5 is a schematic diagram showing the filtering system of thepresent invention;

[0060]FIG. 6 is a schematic diagram showing the disciplining system ofthe present invention

[0061]FIG. 7 is a flowchart showing the extraction of theanti-interference subject matter words of the present invention;

[0062]FIG. 8 is a flowchart showing the calculation of degree of textsimilarity based on the Vector Space Model according to the rectifiedresult of subject matter information of the present invention;

[0063]FIG. 9 is a schematic diagram showing the learning processingmodule of the disciplining system of the present invention;

DETAILED DESCRIPTION OF THE INVENTION

[0064] The contents filter based on similarity of content character andcorrelation of subject matter of the present invention implements thecontents filtering through an analysis and determination of textcontents, and offers an intelligent and effective service of contentssafety.

[0065] As shown in FIG. 1, the distinguished character in the presentinvention is that it provides a notional model of disciplining-filteringsystem construction.

[0066] Universal and unrestrictive text contents are filtered by thecontents filter. When making a filtering request for the similar textswith specified contents, at first user makes the filter obtain relativeknowledge of the specified contents through learning and then deliversthe knowledge to the filter which use it to filter. “disciplining” meansthe procedure of automatic learning. The filter identifies the characterof the contents to be filtered with disciplined classifying characterwhich is offered by user. If the similarity of content character isbeyond the preset threshold value, the contents will be filtered.

[0067] The notional model of disciplining-filtering system can implementthat the contents to be filter are open for user, which make thecontents filter a universal filtering system that is not for a specifiedsubject matter.

[0068] The mentioned contents filter includes a filtering system and adisciplining system; the disciplining system learns with presetinformation, and then obtains the filtering character of theinformation, the filtering system filters the information and thedisciplining system communicates with the filtering system. In thepresent embodiment, the contents filter includes more filtering systemsand more disciplining systems. In fact, the contents filter still caninclude only one disciplining system and one or more filtering systems,or one filtering system and one or more disciplining systems. No matterwhat the number of the filtering systems and the disciplining systems isset, the filtering system and the disciplining system are set separatelyin physics.

[0069] The filtering system is set in the input device of networkinformation, and the different filtering character obtained by thedisciplining system is configured to the filtering systems in thedifferent input devices of network information. The said configurationmeans that the disciplining system distributes the filtering characterof the filtering system according to the burden capacity, the locationand the purpose of the input device of network information in network

[0070] The input device of network information can be firewall or mailserver or proxy server or personal computer; and can also be one or moreinput devices of network information or the combination of any type ofinput device of network information.

[0071] The disciplining system includes a module of classifyingcharacter vocabulary for contents filtering, which is used to constructa classifying character vocabulary learned from the specifiedinformation, and conduct the supplement or update of the classifyingcharacter vocabulary. The classifying character vocabulary is obtainedfrom the specified learning information by the module of classifyingcharacter vocabulary. Once the vocabulary is constructed, thedisciplining system will transfer the vocabulary contents to thefiltering system through the standard data interface, and then thefiltering system follows the vocabulary to perform the filtering action,accordingly implements the instruction to the filtering system's action.

[0072] The disciplining system still includes an anti-interferenceextracting module of text character for contents filtering, which isused to examine the checked information and obtain the interferentialtext in the checked information and instruct the actions of textfiltering of the filtering system. At first this module finds thespecified text information in the checked text to determine whether thesequence of the specified text contents accord with the sequence of thepreset text; then determines the interferential distance between thespecified text information and the checked text contents, if thedistance is less than the preset threshold distance, the text contentsare set as the interferential text contents to be selected.

[0073] As shown in FIG. 2, FIG. 3 and FIG. 4, the system structure ofthe present invention is a system working mode of the discipliningsystem and the filtering system separately installed.

[0074] According to the definition of the notional model ofdisciplining-filtering system, a contents filter system is separated astwo modules: the disciplining system and the filtering system. Thefiltering system in the contents filter can be installed in the inputdevice of network information (such as firewall, mail server, proxyserver etc.), respond to the identifying request of system contentsafety, real-time scan the unknown text content, perform thedetermination of the similarity between the unknown text and thefiltering class character according to the character data of filteringclass, and obtain the similarity between the unknown text and thefiltering class, then make the system to process.

[0075] The work mode of the disciplining system and the filtering systemseparately installed makes the contents filter more flexible. Thedisciplining system is large and powerful, and all the parameters neededin filtering are generated in the disciplining system; while thefiltering system is small and flexible and its processing speed is fast.So it can be installed in multi-type of software systems and hardwaresystems conveniently.

[0076] The filtering system communicates with the disciplining systemthrough the standard data interface; the disciplining system offerssupport to the filtering system in multi-mode:

[0077] The contents filter builds a logic relation with the discipliningsystem through the character data of filtering class, and they can beseparated in physics. User can meet different requirement by means ofdownloading the character data of standard filtering class from thetechnical support web station or disciplining itself with thedisciplining system software.

[0078] The structure of the contents filter can be as follows: onedisciplining system supports more filtering systems; or one filteringsystem supports more disciplining systems or more disciplining systemssupport more filtering systems.

[0079] Referring to FIG. 5, the disciplining system includes a module ofclassifying character vocabulary for contents filtering, ananti-interference extracting module of text character, ananti-interference extracting module of text subject matter and anevaluation and instruction module of disciplining effect.

[0080] In fact filtering is a procedure of classification, but isstricter than classification. The filtering system defines typicaldistinguished words as character words, and constructs a classifyingcharacter vocabulary used for contents filtering by taking statistic onthe text which contains a hundred million words; the vocabulary embodiesabout 20000 vocabulary entries.

[0081] The extraction of text character is to calculate the appearancefrequency of character words and so on in text according to theclassifying character vocabulary for contents filtering. At present inorder to pass the key word filter, some unwelcome network informationintentionally is interfered in some important word, such as “□□□” iswritten as “□#□#□” or “□□□” is written as “□□□□” to make the filter doesnot function. With respect to the contents filter, the text contentcharacter is made weak. According to this situation, the presentinvention provides an anti-interference extracting method to implementthe anti-interference extraction of the text character.

[0082] The extraction of text character is based on the classifyingcharacter vocabulary for contents filtering, the extracting procedure isthe procedure of building the vector of the text character, and is theprocedure of the contents filter building up the “filtering knowledge”.

[0083] Comparing with the text character, the text subject matter moreconcretely indicates the classification of the text contents, eachfiltering class will build up the set of the subject matter words duringthe procedure of disciplining, and it represents the most typicalcharacter in the contents.

[0084] The technology of evaluation and instruction will give theevaluation of filtering effect and disciplining guidance towards thedisciplining effect of user.

[0085] See also FIG. 6, the filtering system includes:

[0086] 1. A classifying character vocabulary for contents filtering;

[0087] 2. Anti-interference extracting of text character;

[0088] 3. Calculating similarity between the text contents to befiltered and the defined filtering contents character;

[0089] Apply the Vector Space Model to implementation of the contentsfilter system; perform the calculation of the vector similarity betweenthe text contents to be filtered and the filtering class character.

[0090] The standard formula to calculate text similarity based on theVector Space Model is as follows:${{{Sim}\left( {w_{i},\nu_{j}} \right)} = {{{Cos}\quad \theta} = \frac{\sum\limits_{k = 1}^{n}\quad {w_{ik} \cdot \nu_{jk}}}{\sqrt{\sum\limits_{k = 1}^{n}\quad w_{ik}^{2}} \cdot \sqrt{\sum\limits_{k = 1}^{n}\nu_{jk}^{2}}}}},$

[0091] in this formula, W_(i), and V_(j) is the vector of text to beanalyzed and the standard vector, w_(ik), and v_(jk) is the part of thevector.

[0092] 4. Calculating R_(is), the correlation which represents thecorrelation between the text contents to be filtered and the definedfiltering contents subject matter; rectifying the similarity by thecorrelation of subject matter words.

[0093] In each text there are some words, which take particular effecton the property of class, named subject matter words of the text. In theprocedure of intelligent classification of human being, the specialcontribution with these subject matter words will be considered andweighted to text class. The subject matter words are obtained by presetappointing or extracted by the subject matter words extractingarithmetic.

[0094] 5. Using the rectifying coefficient of precision P_(i) to rectifylocal similarity and short text similarity.

[0095] 6. After obtaining the degree of similarity by multi-steprectification, then whether filtering the text to be filtered isdetermined according to the preset filtering threshold value U_(w).

[0096] S_(w,v), the degree of similarity by multi-step rectification, isobtained according to the following formula:

S_(w,v□)Sim(W_(i),V_(j))×P_(i)×R_(is)

[0097] If S_(w,v)≧U_(w), the contents filter will ask the system tofilter the text.

[0098] If S_(w,v)<U_(w), the contents filter will consider the text safeand passable.

[0099] As shown in FIG. 7, the subject matter words are the wordssignificant and important on meaning and type for the specified textcontents. The subject matter words set is bigger than or equal to thekey words set, the subject matter words obtained by anti-interferencefiltering can be used in the key words filter or other procedures basedon the subject matter words.

[0100] The subject matter words set of text of specific type can beappointed manually or obtained automatically, the method to obtain themis independent of the present invention.

[0101] The method of anti-interference extracting of the subject matterwords is as follows:

[0102] Considering one subject matter word W=a₁ a₂ . . . a_(n), in whicha₁ . . . a_(n) is the serial sequence character of the subject matterword. During scanning the text S, if

a₁εS, a₂εS, . . . a_(n)εS, and a₁<a₂< . . . <a_(n),

[0103] and if the number of character between a₁ and a_(n) is less thanthe preset threshold distance D, and there is no punctuation between a₁and a_(n), then there is an interferential subject matter word betweena₁ and a_(n). Each time finding the word string, accumulate thefrequency to be selected of the word as F′(W)++. When F′(W) reaches onepreset threshold value F₀, all the interferential word strings areconsidered as the subject matter word W, and the influence is increasedat the time of calculating the information of corresponding subjectmatter word.

[0104] Wherein “<” stands for the precedence relation of the sequence(regardless of adjoining).

[0105] An embodiment is as follows:

[0106] The preset anti-interference distance of the contents filter isequals to 5, the frequency threshold value of interferential word isF₀=3.

[0107] Text i include the subject matter words S, and S=a1 a2 a3 a4 a5.

[0108] According to preliminary analysis, the character string S′ isfound between two neighboring punctuation:

[0109] S′=a1 x a2 x a3 a4 x a5, in which, x stands for any characterexcept punctuation.

[0110] Examining the relation between S′ and S with theanti-interference rule, there exists a1<a2<a3<a4<a5, and the number ofcharacter is 3 between a1 and a5, less than the anti-interferencedistance D=5, and there are no punctuation between a1 and a5, then thesaid case fits the condition, so it comes into existence that S′ equalsto S, S′ is considered as one of the subject matter words to be selectedin text i. Then, if we found S′ or transmutation of S′ about location ofthe interferential character x more than 3 times, then it is concludedthat S′ is the interferential word of S. i.e., which comes intoexistence that the frequency of interferential word S F′(S) is largerthan the threshold value F₀, so through anti-interference processing ofthe subject matter word, it is considered that S′ accords with thesubject matter words of text i, and will be treated as the subjectmatter word in the contents filter.

[0111] As shown in FIG. 8, the ordinary method of calculating similarityof text based on the vector space is as follows:${{Sim}\left( {w_{i},\nu_{j}} \right)} = {{{Cos}\quad \theta} = \frac{\sum\limits_{k = 1}^{n}\quad {w_{ik} \cdot \nu_{jk}}}{\sqrt{\sum\limits_{k = 1}^{n}w_{ik}^{2}} \cdot \sqrt{\sum\limits_{k = 1}^{n}\nu_{jk}^{2}}}}$

[0112] In above formula, Wi, Vj is the vector of text to be analyzed andthe standard vector respectively, wik, vjk is the part of the vector.Therefore it is shown that all words are treated equally during theprocess of calculating similarity of degree.

[0113] Besides the character words, there exist some special words ineach class of text, they give specific contribution to the class whichthe text belongs to, these special words are called character words orsubject matter words. In the procedure of intelligent classification ofhuman being, the special contribution of these subject matter words willbe considered and weighted to text class.

[0114] Based on this thought, in order to make result of the similaritymore efficient and natural, an extracting method is set according to thesubject matter word, and the said standard method is rectified accordingto the extracted subject matter words.

[0115] Before the rectification on the subject matter word, first stepis to extract the subject matter words of specific class. Procedure indetail is when analyzing the specific text and extracting the charactervector of text, the subject matter words are exacted with overallconsideration of the frequency of word, the concentration of frequencyand the information of word length. We provide a concrete method asfollows: w ik = ( K 1 + K 1 × tf MAX     tf ) × 1 log 2 T w tf    × (K 2 + K 2 × w l MAX     w l )

[0116] in which, □ stands for the factor of the frequency of word; □stands for the factor of the concentration of frequency; □ stands forthe factor of word length □W_(ik) stands for the weight of the word intext I; tf stands for the frequency of the word k in text i; MAXtfstands for the frequency of the word with maximum frequency; K₁ standsfor the grade of importance to tf, commonly set to 0.5; MAXW_(l) standsfor the maximum value of the word length in the text; K₂ stands for thegrade of importance to W_(l), commonly set to 0.5; T_(w) stands for theamount of total words (considering the character words only).

[0117] In the procedure of disciplining, a group of words with maximumvalue are extracted as the standard subject matter words set, whendealing with the text to be analyzed, the subject matter words set oftext to be analyzed is calculated with the formula too, and the subjectmatter words are rectified according to the said two subject matter wordsets.

[0118] The embodiment is as follows:

[0119] Determining whether one character word W pertains the subjectmatter words of text i.

[0120] Total character words amount in text i is T_(w)□100, the maximumfrequency of word MAXtf is equal to 6, the maximum word length MAXW_(l)is equal to 5, there is a character words W in the text, it's wordlength W_(l) is equal to 3, it's frequency tf in text is 5.

[0121] Setting K₁□K₂∇0.5.

[0122] Calculating the weight of the character word W in text i usingthe subject matter extracting formula, then$w_{ik} = {{\left( {0.5 + \frac{0.5 \times 5}{6}} \right) \times \frac{1}{\log_{2}^{\frac{100}{5}}} \times \left( {0.5 + {0.5 \times \frac{3}{6}}} \right)} \approx {0.159.}}$

[0123] Repeating the said steps, the weights of all one hundredcharacter words in text i can be calculated, then all the characterwords are arranged in weight order. If ten subject matter words in texti are extracted, the maximum top ten character words are chosen as thesubject matter words of text, if the weight W_(ik) of the word W meetscondition, the word W is namely the subject matter word of text i.

[0124] When the similarity of text to be analyzed is calculated□thesubject matter rectifying coefficient is adjusted according to thedegree of overlapping of text to be analyzed and the standard subjectmatter words set based on the thoughts of subject matter rectifying.

[0125] The formula for subject matter words rectifying is as follows:$R_{is} = {A + \frac{T_{is}\bigcap C_{s}}{C_{s}}}$

[0126] in which, A is an experience value (0<A<1), generally set to 0.7,which reflects the degree of paid importance to the subject matter word;R_(is) is a correlation coefficient of the subject matter words in rangefrom A to A+1; T_(is) is the subject matter words amount of the text tobe analyzed; C_(s) is the subject matter words amount of standard class.“∩” stands for calculation of intersection, namely determines the amountwhich C_(s) contains T_(is), the calculation of intersection is immunefrom the sequence of the subject matter words.

[0127] The coefficient of the subject matter words aims at determiningthe similarity of contents by the degree of overlapping of the subjectmatter words. As shown in above formula, as long as the overlapping ofthe subject matter words reach 1−A, i.e.$\frac{T_{is}\bigcap C_{s}}{C_{s}},$

[0128] the ratio of the subject matter words to be analyzed and thestandard subject matter words, is larger than 1−A, if R_(is) is morethan 1, the similarity of character vector will be strengthened; whileon the contrary, if R_(is) is less than 1, the similarity of charactervector will be weakened.

[0129] The method of the present invention aims at rectifying thesimilarity of text based on the Vector Space Model by the subject matterwords, namely, rectifying the similarity of text based on the VectorSpace Model by subject matter words rectification. As follows shows:

[0130] The correlation degree between the text to be analyzed and thestandard text is equal to Sim(W_(i),V_(j))×R_(is), in which R_(is) is acorrelation rectifying coefficient of the subject matter words.

[0131] An embodiment is:

[0132] There is one filtering class T, which has a subject matter wordsset

[0133] Subj_T={S₁,S₂,S₃,S₄,S₅,S₆,S₇,S₈,S₉,S₁₀}

[0134] The degree of similarity between text i and the filtering classT, which is calculated by the Vector Space Model, is Sim(t,i), and thesubject matter words set of text i is obtained through the subjectmatter words extracting:

[0135] Subj_i={i₁,i₂,i₃,i₄,i₅,i₆,i₇,i₈,i₉,i₁₀}.

[0136] Calculate the intersection of Subj_T and Subj_i, i.e. determinethe amount of S_(i) equaling to i_(k).

[0137] 1) If Subj_T∩Subj_i is equal to 7, set A is 0.7, the rectifyingvalue of subject matter words is${R_{is} = {{0.7 + \frac{T_{is}\bigcap C_{s}}{C_{s}}} = {{0.7 + \frac{7}{10}} = 1.4}}},$

[0138] then the text similarity from VSM model is rectified by R_(is).

[0139] The correlation degree of similarity between text i to beanalyzed with class T is equal to Sim(i,T)×R_(is)□1.4×Sim(i,T), thesimilarity of text is rectified to a large value, which shows that thehigh subject matter correlation degree of text i and the filtering classT increases the degree of text contents.

[0140] 2) If Subj_T∩Subj_i is equal to 1, set A is 0.7, the rectifyingvalue of subject matter words is$R_{is} = {{0.7 + \frac{T_{is}\bigcap C_{s}}{C_{s}}} = {{0.7 + \frac{1}{10}} = {0.8.}}}$

[0141] The text similarity from VSM model is rectified by R_(is).

[0142] The correlation degree of similarity between text i to beanalyzed with class T is equal to Sim(i,T)×R_(is)□0.8×Sim(i,T), thesimilarity of text is rectified to a small value, which shows that thesubject matter departure of text i from the filtering class T weakensthe degree of text contents.

[0143] As shown in FIG. 9, the process of evaluation of discipliningeffect includes adopting the appointed disciplining text, extracting theclass character through disciplining, then expressing the text contents,finally submitting to the filter to conduct the operation of filtering.

[0144] The evaluation of disciplining effect includes three aspects: theevaluation of character words amount, the evaluation of rate of repeatof character words and the evaluation of degree of subject mattercentralization. When the quantity of disciplining reaches an amount suchas 100K, 200K etc, (the point is called the disciplining evaluationpoint), according to the coefficient of evaluation enunciated, theresult of evaluation of disciplining effect is educed.

[0145] In detail, the coefficient of the evaluation of character wordsamount is obtained as follows:

[0146] Because the character words reflect the main contents oflinguistic elements, the less the character words amount referring todisciplining text are, the more the centralized linguistic elements are,so a coefficient of character words amount is set.

[0147] The character words amount in the disciplining text is x_(i), thetotal amount of the character words is A. The threshold value α_(l) isset according to experience. The formula for Q₁: is as follows:$Q_{1} = \left\{ {\begin{matrix}1 & {{{when}\quad x_{i}} < \alpha_{i}} \\\frac{A - x_{i}}{A - \alpha_{i}} & {{{when}\quad x_{i}} > \alpha_{i}}\end{matrix}.} \right.$

[0148] According to experience, α_(l) of each evaluation point is asfollows: Quantity of disciplining: 100 k 200 k 300 k 400 k a_(i)□ 25003400 4200 4800

[0149] The coefficient of the evaluation of rate of repeat of characterwords is as follows:

[0150] Because the character words reflect the main contents oflinguistic elements, so the higher the rate of repeat of character wordsare in disciplining text, the more centralized linguistic elements are,therefore a coefficient of the evaluation of rate of repeat of characterwords is set.

[0151] If at the disciplining evaluation point of i, the character wordsis extracted from the disciplining text of group i, and compared withthe set of character words from the disciplining text of i−1 groups, therate of repeat of character words is calculated. If the mean rate ofrepeat is x_(i), the experience threshold value is β, then the formulaof Q₂ is shown as follows: $Q_{2} = \left\{ \begin{matrix}\begin{matrix}{x_{i}/\beta} & {{{when}\quad x_{i}} < \beta} \\1 & {{{when}\quad x_{i}} > \beta}\end{matrix} & \quad & \quad & {{let\_}\left( {= 0.4_{\quad \circ}} \right.}\end{matrix} \right.$

[0152] Furthermore, the coefficient of the evaluation of degree ofsubject matter centralization is obtained as follows:

[0153] If the subject matter of disciplining linguistic elements isrelatively centralized, the most linguistic elements will talk about thesame topic. According to this thought, a coefficient of the evaluationof degree of subject matter centralization is set.

[0154] If at the disciplining evaluation point of i, xi, the maximumdocument overlapping rate of the top n of high frequency character wordsis extracted from the disciplining linguistic elements of the group i,and the experience threshold value ( is set, The formula of Q3 is shownas follows:

EMBED  Equation.3

[0155] The experience value is: (=0.8, n=50.

[0156] Finally the formula of disciplining effect is:

Q=Q1*Q2*Q3 or Q=Q1*Q2 or Q=Q1*Q3 or Q=Q1 or Q=Q2 or Q=Q3

[0157] Accord to the value of Q, the grade of disciplining effect isdetermined.

Q□ 0-0.2 0.2-0.4 0.4-0.6 0.6-0.80.8-1.0

[0158] Grade of effect: worst bad normal good best.

[0159] According as the said result can conduct the disciplining systemof the filter better so as to improve the disciplining effect.

[0160] Comparison with concrete examples is shown as follows:

[0161] To some kinds of disciplining text which has high degree ofconcentration, accompanying with extracting some farraginous texts fromone multipurpose network to be contrast in this example of experiment,the said method is used to verify the effect of disciplining. The resultis shown as follows:

[0162] Effect of the texts with better concentration: Quantity ofdisciplining: 100 k 200 k 300 k 400 k Q₁ 1 1 1 1 Q₂ 1 1 1 1 Q₃ 1 1 1 1 Q1 1 1 1

[0163] Effect of a group of texts with farraginous: Quantity ofdisciplining: 100 k 200 k 300 k 400 k Q₁ 0.95 0.9 0.86 0.85 Q₂ 1 0.8 0.70.75 Q₃ 0.85 0.67 0.65 0.35 Q 0.80 0.48 0.39 0.22

[0164] Obviously the disciplining effect of farraginous text is far fromthe effect of the present invention.

[0165] The concept of Space of Vector Model is that the text isconsidered as a group of vocabulary entry, each of them is set with atantamount weight according to the importance of each vocabulary entry.Then a vector space is constructed, each text can be expressed as aVector Space Model which consists of a vocabulary entry and a weight, asshown in follows:

TW=((t₁,W₁),(t₂,W₂), . . . ,(t_(n),W_(n))

[0166] Consequently the problem of matching of text contents istransformed to the calculation of vector correlation in vector space.

[0167] The standard formula of similarity of text based on Space ofVector Model is shown as follows:${{Sim}\left( {w_{i},v_{j}} \right)} = {{{Cos}\quad \theta} = \frac{\sum\limits_{k = 1}^{n}{w_{ik} \cdot v_{jk}}}{\sqrt{\sum\limits_{k = 1}^{n}w_{ik}^{2}} \cdot \sqrt{\sum\limits_{k = 1}^{n}v_{jk}^{2}}}}$

[0168] in which, W_(i), V_(j) is the text vector to be analyzed and thestandard vector separately, w_(ik), v_(jk) is the part of eachcorresponding vector. The said formula's function is to calculate the ofsimilarity of W_(i) and Vj.

[0169] In practice, there exists in this formula such problem asfollows: the text to be analyzed, which does not belong to class V_(j),maybe obtain a higher similarity because of containing the part of highweight words of the standard vector V_(j). This is abnormal and is alsothe defect of this method. This case will be especially outstanding whenthe text to be analyzed contains few character words but the characterwords with high weight.

[0170] In the process of intelligent classifying, the text to beanalyzed will be not classified to V_(j) because of containing some highweight words, but the similarity of text of this kind will be reduced.

[0171] Therefore, a method for rectification based on precision ofsimilarity is involved to make the result be more effective andinvoluntary. This method is shown as follows:

[0172] The correlation degree between the text i to be analyzed and thestandard text is equal to Sim(W_(i),V_(j))×P_(i) in which, P_(i) standsfor the rectifying coefficient of precision.

[0173] The concept of precision is as follows:

[0174] P_(i) is a degree data, which presents how much precision thetext to be analyzed belongs to the standard class, called precision (ofsimilarity).

[0175] The formula to calculate is shown as follows: $\begin{matrix}{P_{i} = {B\sqrt{\frac{\sum\left( {\sigma_{k}v_{jk}} \right)^{2}}{\sum\left( v_{jk} \right)^{2}}}}} \\{{in}\quad {which}\quad \bullet} \\{{B \geq {1\quad \bullet \quad {and}\quad \sigma_{k}}} = \left\{ \begin{matrix}1 & {{{when}\quad w_{jk}} > 0} \\0 & {{{when}\quad w_{jk}} = 0}\end{matrix} \right.}\end{matrix}$

[0176] B is an experience value, which stands for the importance to theinformation of precision. When P_(i) is more than 1, the similarity ofcharacter vector is strengthened; on the contrary the similarity ofcharacter vector is weakened.

[0177] An embodiment is shown as follows:

[0178] A kind of text T can be present by

T={(t₁,100),(t₂,100),(t₃,50),(t₄,50),(t₅,10), . . . ,(t₂₀,10)}, (inwhich, t_(i) is the character words).

[0179] A text to be analyzed, M, after processing we obtain thecharacter vector model of it as follows: M={(t_(i),100),(t₂,100)}.

[0180] According to the vector M to be analyzed, the vector T of text ofclassification is rectified, by the calculation of text similarity inSpace of Vector Model we obtain: Sim(T,M)=0.87;

[0181] Ostensibly the text M and T is high similarity according to theresult, while actually the text M only reflects the local part of classT, just local part is highly similar. When calculating the similarity inSpace of Vector Model, the problems of local similarity and short textsimilarity can be solved. But it is unnatural that the similarity isincreased by a few high weight words.

[0182] Add the rectification of precision, let B equal to 1, then P_(i)is equal to 0.8, the similarity is more reduced, the result is moreinvoluntary. This method especially has more influence near thethreshold value which is used to determine the class belonged to, andmake some text whose similarity is a little higher than the thresholdvalue to be reduced similarity below the threshold value.

1. A contents filter based on similarity of content character andcorrelation of subject matter, which is characterized that the contentsfilter includes a filtering system and a disciplining system at least;the disciplining system learns with the appointed information to obtainthe filtering character of the information; the filtering system filtersthe information, and the disciplining system communicates with thefiltering system.
 2. The contents filter based on similarity of contentcharacter and correlation of subject matter according to claim 1, ischaracterized that the said contents filter includes one discipliningsystem and one or more filtering systems at least.
 3. The contentsfilter based on similarity of content character and correlation ofsubject matter according to claim 1, is characterized that the saidcontents filter includes one filtering system and one or moredisciplining systems at least.
 4. The contents filter based onsimilarity of content character and correlation of subject matteraccording to claim 1, is characterized that the said contents filterincludes more filtering systems and more disciplining systems.
 5. Thecontents filter based on similarity of content character and correlationof subject matter according to claim 1, is characterized that thefiltering system and the disciplining system are installed separately,and the filtering system communicates with the disciplining systemthrough the data interface.
 6. The contents filter based on similarityof content character and correlation of subject matter according toclaim 5, is characterized that the said separate installation meansseparate installation in physics.
 7. The contents filter based onsimilarity of content character and correlation of subject matteraccording to claim 6, is characterized that the filtering system can beinstalled in an input device of network information.
 8. The contentsfilter based on similarity of content character and correlation ofsubject matter according to claim 7, is characterized that the differentfiltering characters obtained by the disciplining system are configuredto the filtering systems located in the different input devices ofnetwork information.
 9. The contents filter based on similarity ofcontent character and correlation of subject matter according to claim8, is characterized that the said configuration is to distribute thefiltering character of the filtering system according to the burdencapacity, the location and the purpose of the input device of networkinformation in network.
 10. The contents filter based on similarity ofcontent character and correlation of subject matter according to claim7, is characterized that the said input device of network information isfirewall.
 11. The contents filter based on similarity of contentcharacter and correlation of subject matter according to claim 8, ischaracterized that the said input device of network information is mailserver.
 12. The contents filter based on similarity of content characterand correlation of subject matter according to claim 8, is characterizedthat the said input device of network information is proxy server. 13.The contents filter based on similarity of content character andcorrelation of subject matter according to claim 8, is characterizedthat the said input device of network information is personal computer.14. The contents filter based on similarity of content character andcorrelation of subject matter according to claim 7, is characterizedthat the said input device of network information is one input device ofnetwork information or more input devices of network information or thecombination of any type of input device of network information.
 15. Thecontents filter based on similarity of content character and correlationof subject matter according to claim 1, is characterized that the saiddisciplining system includes a module of classifying charactervocabulary for contents filtering, which is used to construct aclassifying character vocabulary learned from the special information,and conduct the supplement or update of the filtering system.
 16. Thecontents filter based on similarity of content character and correlationof subject matter according to claim 1, is characterized that the saiddisciplining system includes an anti-interference extracting module oftext character for contents filtering; which is used to examine andobtain the interferential text in the checked information, with whichinstructing the text filtering of the filtering system.
 17. The contentsfilter based on similarity of content character and correlation ofsubject matter according to claim 1, is characterized that the saiddisciplining system still includes an anti-interference extractingmodule of text subject matter.
 18. The contents filter based onsimilarity of content character and correlation of subject matteraccording to claim 17, is characterized that the procedure of the saidanti-interference extracting module of text subject matter extractingthe anti-interference subject matter words includes the steps asfollows: Step 1: The anti-interference extracting module of text subjectmatter examines the specified character in the checked text, todetermine whether the sequence of the specified character accords withthe sequence of the character in the preset subject matter words, i.e.finding the specified character string; Step 2: The anti-interferenceextracting module of text subject matter determines the interferentialdistance, if the interferential distance is less than the presetthreshold distance, the string is considered as the interferentialsubject matter words to be selected; Step 3: While the anti-interferenceextracting module of text subject matter concludes that the frequency ofappearance of the mentioned subject matter words is beyond the presetthreshold value, the subject matter words to be selected are set to thekey words of the filter.
 19. The contents filter based on similarity ofcontent character and correlation of subject matter according to claim18, is characterized that the said anti-interference extracting moduleof text subject matter examining the specified character still includesfinding whether the specified characters contain Chinese punctuationamong them, if the specified characters don't contain Chinesepunctuations, the character string is considered as the interferentialsubject matter words, and the anti-interference extracting module oftext subject matter sets the character string as the key words of thefilter.
 20. The contents filter based on similarity of content characterand correlation of subject matter according to claim 18, ischaracterized that the said step 1 can be directly implemented asfollows: the anti-interference extracting module of text subject matterexamines the specified character string between two adjacentpunctuations.
 21. The contents filter based on similarity of contentcharacter and correlation of subject matter according to claim 18, ischaracterized that the said frequency of appearance of theinterferential subject matter word to be selected can be summation ofthe interferential subject matter words of more different types.
 22. Thecontents filter based on similarity of content character and correlationof subject matter according to claim 18, is characterized that the saidanti-interference extracting module of text subject matter is used toextract the information relating to text subject matter, and rectify theextracted information, then rectify the similarity of text based on theVector Space Model with the rectified result of the subject matterinformation.
 23. The contents filter based on similarity of contentcharacter and correlation of subject matter according to claim 22, ischaracterized that rectifying the similarity of text based on the vectorspace according to the rectified result of the subject matterinformation includes the steps as follows: Step1: The anti-interferenceextracting module of text subject matter extracts the informationrelating to subject matter of text; Setp2: The anti-interferenceextracting module of text subject matter rectifies the result ofsimilarity of text based on the Vector Space Model.
 24. The contentsfilter based on similarity of content character and correlation ofsubject matter according to claim 23, is characterized that before thesaid step2, the steps also include: the extracted information relatingto subject matter is rectified, then the similarity of text based on theVector Space Model is rectified.
 25. The contents filter based onsimilarity of content character and correlation of subject matteraccording to claim 23, is characterized that the information relating totext subject matter extracted by the said anti-interference extractingmodule of text subject matter in step1 is the frequency of word, theconcentration of frequency, the information of word length, the words,and the total words amount; and the information relating to subjectmatter is the information relating to subject matter with top weightafter weighting.
 26. The contents filter based on similarity of contentcharacter and correlation of subject matter according to claim 25, ischaracterized that the extracting of the information relating to subjectmatter by the anti-interference extracting module of text subject matteris performed with the formula as follows:$w_{ik} = {\left( {K_{1} + \frac{K_{1} \times {tf}}{MAXtf}} \right)^{1◯} \times {\frac{1}{\log_{2}^{\frac{T_{w}}{tf}}}\quad}^{2◯} \times \left( {K_{2} + {K_{2} \times \frac{w_{l}}{{MAX}\quad w_{l}}}} \right)^{3◯}}$

wherein □ stands for the factor of frequency of word; □ stands for thefactor of the concentration of frequency; □ stands for the factor ofword length; W_(ik) stands for the weight of the word in the text i; tfstands for the frequency of word k in the text i; MAXtf stands for theword frequency of the word with maximum frequency; K₁ stands for thegrade of importance to tf, commonly set to 0.5; MAXW_(l) stands for themaximum value of the word length in the text; K₂ stands for the grade ofimportance to W_(l), commonly set to 0.5; T_(w) stands for the amount ofwords which only include character words.
 27. The contents filter basedon similarity of content character and correlation of subject matteraccording to claim 25, is characterized that the anti-interferenceextracting module of text subject matter rectifying the extractedinformation relating to subject matter is that the similarity ofcontents is determined by the degree of overlapping of subject matterinformation.
 28. The contents filter based on similarity of contentcharacter and correlation of subject matter according to claim 24, ischaracterized that the rectification of the similarity of text based onthe Vector Space Model is as follows: if the degree of overlapping ismore than the threshold value, the value of eigenvector similarity willbe strengthened, and if the degree of overlapping is less than thethreshold value, the value of eigenvector similarity will be weakened.29. The contents filter based on similarity of content character andcorrelation of subject matter according to claim 24, is characterizedthat the rectification of the information relating to subject matter isperformed with the following formula:$R_{is} = {A + \frac{T_{is}\bigcap C_{s}}{C_{s}}}$

wherein A is an experiential value reflecting the degree of paidimportance to the subject matter word (0<A<1), R_(is) is a correlationcoefficient of the subject matter word; T_(is) is the subject matterwords amount of the text to be analyzed; C_(s) is the subject matterwords amount of standard class, “∩” stands for calculation ofintersection.
 30. The contents filter based on similarity of contentcharacter and correlation of subject matter according to claim 28, ischaracterized that the rectification of the similarity of text based onthe Vector Space Model is: Sim(W_(i),V_(j))×R_(is), in which,Sim(W_(i),V_(j)) is the similarity of text based on the Vector SpaceModel, R_(is) is a correlation coefficient of the subject matter word.31. The contents filter based on similarity of content character andcorrelation of subject matter according to claim 23, is characterizedthat the said information relating to subject matter is the subjectmatter words or the character words.
 32. The contents filter based onsimilarity of content character and correlation of subject matteraccording to claim 1, is characterized that the said disciplining systemstill includes an evaluation and instruction module of discipliningeffect.
 33. The contents filter based on similarity of content characterand correlation of subject matter according to claim 32, ischaracterized that the said evaluation and instruction module ofdisciplining effect is used to obtain the coefficients of the evaluationof the character words amount, the evaluation of the rate of repeat andthe evaluation of the degree of subject matter centralization, thenaccording to these coefficients, the result of disciplining effect iseduced to give an objective and quantitative instruction todisciplining.
 34. The contents filter based on similarity of contentcharacter and correlation of subject matter according to claim 32, ischaracterized that the evaluation of the character words amount isexecuted according to the following formula:$Q_{1} = \left\{ \begin{matrix}1 & {{\quad x_{i}} < \alpha_{i}} \\\frac{A - x_{i}}{A - \alpha_{i}} & {{\quad x_{i}} > \alpha_{i}}\end{matrix} \right.$

wherein x_(i) stands for the character words in text of disciplining, Astands for the total amount of the character words, α_(l) is anexperiential threshold value of the character words amount for eachdisciplining evaluation point.
 35. The contents filter based onsimilarity of content character and correlation of subject matteraccording to claim 33, is characterized that the evaluation of the rateof repeat is executed according to the following formula:$Q_{2} = \left\{ \begin{matrix}{x_{i}/\beta} & {{\quad x_{i}} < \beta} \\1 & {{\quad x_{i}} > \beta}\end{matrix} \right.$

in which, x_(i) stands for the mean rate of repeat, β is an experientialthreshold value.
 36. The contents filter based on similarity of contentcharacter and correlation of subject matter according to claim 33, ischaracterized that the evaluation of the degree of subject mattercentralization is executed according to the following formula:$Q_{3} = \left\{ \begin{matrix}{x_{i}/\chi} & {{\quad x_{i}} < \chi} \\1 & {{\quad x_{i}} > \chi}\end{matrix} \right.$

in which, x_(i) stands for the maximum overlapping rate of document, χis an experiential threshold value.
 37. The contents filter based onsimilarity of content character and correlation of subject matteraccording to claim 34, is characterized that the evaluation ofdisciplining effect is executed according to the following formula: Q=Q₁ *Q ₂ *Q ₃ or Q=Q ₁ *Q ₂ or Q=Q ₁ *Q ₃ or Q=Q₁ or Q=Q₂ or Q=Q₃, thenaccording to the value of Q, the grade of disciplining effect isdetermined.
 38. The contents filter based on similarity of contentcharacter and correlation of subject matter according to claim 1, ischaracterized that the said filtering system includes a module ofclassifying character vocabulary for contents filtering, ananti-interference exacting module of text character, and a module ofcalculating similarity between text contents to be filtered and definedfiltering contents.
 39. The contents filter based on similarity ofcontent character and correlation of subject matter according to claim1, is characterized that the said filtering system includes a module ofrectifying local similarity and short text similarity with precision.40. The contents filter based on similarity of content character andcorrelation of subject matter according to claim 39, is characterizedthat the said module of rectifying local similarity and short textsimilarity with precision is used to obtain precision of relegation ofstandard class which text to be analyzed belongs to according to thestandard vector of text to be analyzed, and use the said precision torectify the result of the similarity of text based on the Vector SpaceModel.
 41. The contents filter based on similarity of content characterand correlation of subject matter according to claims 40, ischaracterized that the said rectification can be Sim(W_(i),V_(j))×P_(i),in which P_(i) stands for the rectifying coefficient of precision,Sim(W_(i),V_(j)) is the similarity of text based on the Vector SpaceModel.
 42. The contents filter based on similarity of content characterand correlation of subject matter according to claim 41, ischaracterized that the rectifying coefficient of precision can beobtained through the following formula:$P_{i} = {B\sqrt{\frac{\sum\left( {\sigma_{k}v_{jk}} \right)^{2}}{\sum\left( v_{jk} \right)^{2}}}}$

in which B≧1 and $\sigma_{k} = \left\{ \begin{matrix}1 & {{{when}\quad w_{jk}} > 0} \\0 & {{{when}\quad w_{jk}} = 0}\end{matrix} \right.$

experienced value of the grade of importance to the precisioninformation.
 43. The contents filter based on similarity of contentcharacter and correlation of subject matter according to claim 1, ischaracterized that the said filtering system includes a filtering moduleaccording to multi-step rectified degree of similarity.
 44. The contentsfilter based on similarity of content character and correlation ofsubject matter according to claim 43, is characterized that the saidfiltering module according to multi-step rectified degree of similarityis used to gather the coefficients of precision obtained by each module,with the preset filtering threshold value U_(w) to determine whether thetext to be filtered should be filtered.